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Abstract 


Semantic matching is of central importance to many natural language tasks |[2]j28|. 
A successful matching algorithm needs to adequately model the internal structures 
of language objects and the interaction between them. As a step toward this goal, 
we propose convolutional neural network models for matching two sentences, by 
adapting the convolutional strategy in vision and speech. The proposed models 
not only nicely represent the hierarchical structures of sentences with their layer- 
by-layer composition and pooling, but also capture the rich matching patterns at 
different levels. Our models are rather generic, requiring no prior knowledge on 
language, and can hence be applied to matching tasks of different nature and in 
different languages. The empirical study on a variety of matching tasks demon¬ 
strates the efficacy of the proposed model on a variety of matching tasks and its 
superiority to competitor models. 


1 Introduction 

Matching two potentially heterogenous language objects is central to many natural language appli¬ 
cations l28l[2l. It generalizes the conventional notion of similarity (e.g., in paraphrase identification 
lfl9l ) or relevance (e.g., in information retrieval l27l k since it aims to model the correspondence be¬ 
tween “linguistic objects” of different nature at different levels of abstractions. Examples include 
top -k re-ranking in machine translation (e.g., comparing the meanings of a French sentence and an 
English sentence 0) and dialogue (e.g., evaluating the appropriateness of a response to a given 
utterance |[26l ). 

Natural language sentences have complicated structures, both sequential and hierarchical, that are 
essential for understanding them. A successful sentence-matching algorithm therefore needs to 
capture not only the internal structures of sentences but also the rich patterns in their interactions. 
Towards this end, we propose deep neural network models, which adapt the convolutional strategy 
(proven successful on image lHTl and speech Q]) to natural language. To further explore the relation 
between representing sentences and matching them, we devise a novel model that can naturally 
host both the hierarchical composition for sentences and the simple-to-comprehensive fusion of 
matching patterns with the same convolutional architecture. Our model is generic, requiring no 
prior knowledge of natural language (e.g., parse tree) and putting essentially no constraints on the 
matching tasks. This is part of our continuing effor^in understanding natural language objects and 
the matching between them [ 13 ,j26|. 

*The work is done when the first author worked as intern at Noah’s Ark Lab, Huawei Techologies 

! Our project page: http : //www. noahlab . com. hk/techno logy/ Lear ning2Match . html 
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Our main contributions can be summarized as follows. First, we devise novel deep convolutional 
network architectures that can naturally combine 1) the hierarchical sentence modeling through 
layer-by-layer composition and pooling, and 2) the capturing of the rich matching patterns at dif¬ 
ferent levels of abstraction; Second, we perform extensive empirical study on tasks with different 
scales and characteristics, and demonstrate the superior power of the proposed architectures over 
competitor methods. 

Roadmap We start by introducing a convolution network in Section [2] as the basic architecture for 
sentence modeling, and how it is related to existing sentence models. Based on that, in Section |3j 
we propose two architectures for sentence matching, with a detailed discussion of their relation. In 
Section[4| we briefly discuss the learning of the proposed architectures. Then in Section[5j we report 
our empirical study, followed by a brief discussion of related work in Section [6] 

2 Convolutional Sentence Model 

We start with proposing a new convolutional architecture for modeling sentences. As illustrated 
in Figure [TJ it takes as input the embedding of words (often trained beforehand with unsupervised 
methods) in the sentence aligned sequentially, and summarize the meaning of a sentence through 
layers of convolution and pooling, until reaching a fixed length vectorial representation in the final 
layer. As in most convolutional models ifTTl fill, we use convolution units with a local “receptive 
field” and shared weights, but we design a large feature map to adequately model the rich structures 
in the composition of words. 


convolution 


D a pooling 



more convolution Axed length vector 
and pooling 




Figure 1: The over all architecture of the convolutional sentence model. A box with dashed lines 
indicates all-zero padding turned off by the gating function (see top of Page 3). 


Convolution As shown in Figure [T] the convolution in Layer-1 operates on sliding windows of 
words (width fci), and the convolutions in deeper layers are defined in a similar way. Generally,with 
sentence input x, the convolution unit for feature map of type-/ (among F\ of them) on Layer-i? is 


SU) def SU) 

6a 6- 


(x)=v(w«■/>*(«-«+*«/)), / = 1,2,... 


( 1 ) 


and its matrix form is zf^ = zf^ (x) = (j(yV^zf ^ + b^), where 


• (x) gives the output of feature map of type-/ for location % in Layer-i?; 

• is the parameters for / on Layer-^, with matrix form = f [w^’ 1 ), • • • , 

• cr(-) is the activation function (e.g., Sigmoid or Relu 171 ) 

• z • ; denotes the segment of Layer-^—1 for the convolution at location i , while 


i(0) _ . 


def | 


concatenates the vectors for k\ (width of sliding window) words from sentence input x. 
Max-Pooling We take a max-pooling in every two-unit window for every /, after each convolution 

zf J) = max(4ilF ) ,4i~ 1 ’ /) )> ^ = 2,4, • • • . 

The effects of pooling are two-fold: 1) it shrinks the size of the representation by half, thus quickly 
absorbs the differences in len gth f or sentence representation, and 2) it filters out undesirable com¬ 
position of words (see Section [2T] for some analysis). 
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Length Variability The variable length of sentences in a fairly broad range can be readily handled 
with the convolution and pooling strategy. More specifically, we put all-zero padding vectors after 
the last word of the sentence until the maximum length. To eliminate the boundary effect caused 
by the great variability of sentence lengths, we add to the convolutional unit a gate which sets the 
output vectors to all-zeros if the input is all zeros. For any given sentence input x, the output of 
type-/ filter for location i in the I th layer is given by 

z^J) ^ x ) = g(zf -1 )) • + 6 (£,/) ), (2) 

where g(y) = 0 if all the elements in vector v equals 0, otherwise g(y) = 1. This gate, working 
with max-pooling and positive activation function (e.g., Sigmoid), keeps away the artifacts from 
padding in all layers. Actually it creates a natural hierarchy of all-zero padding (as illustrated in 
Figure [TJ, consisting of nodes in the neural net that would not contribute in the forward process (as 
in prediction) and backward propagation (as in learning). 


2.1 Some Analysis on the Convolutional Architecture 

The convolutional unit, when com¬ 
bined with max-pooling, can act as 
the compositional operator with lo¬ 
cal selection mechanism as in the 
recursive autoencoder ED Figure 
[2] gives an example on what could 
happen on the first two layers with 
input sentence “The cat sat on 
the mat”. Just for illustration pur¬ 
pose, we present a dramatic choice 
of parameters (by turning off some 
elements in W^) to make the con¬ 
volution units focus on different seg¬ 
ments within a 3-word window. For 
example, some feature maps (group 
2) give compositions for “the cat” 
and “cat sat”, each being a vector. Different feature maps offer a variety of compositions, with 
confidence encoded in the values (color coded in output of convolution layer in Figure [2]). The pool¬ 
ing then chooses, for each composition type , between two adjacent sliding windows, e.g., between 
“on the” and “the mat” for feature maps group 2 from the rightmost two sliding windows. 

Relation to Recursive Models Our convolutional model differs from Recurrent Neural Network 
(RNN, [15]) and Recursive Auto-Encoder (RAE, [2JJ) in several important ways. First, unlike 
RAE, it does not take a single path of word/phrase composition determined either by a separate 
gating function Oil , an external parser fl9l , or just natural sequential order l20l . Instead, it takes 
multiple choices of composition via a large feature map (encoded in for different /), and 

leaves the choices to the pooling afterwards to pick the more appropriate segments(in every adjacent 
two) for each composition. With any window width kn > 3, the type of composition would be much 
richer than that of RAE. Second, our convolutional model can take supervised training and tune 
the parameters for a specific task, a property vital to our supervised learning-to-match framework. 
However, unlike recursive models EBED, the convolutional architecture has a fixed depth, which 
bounds the level of composition it could do. For tasks like matching, this limitation can be largely 
compensated with a network afterwards that can take a “global” synthesis on the learned sentence 
representation. 

Relation to “Shallow” Convolutional Models The proposed convolutional sentence model takes 
simple architectures such as [18] 10] (essentially the same convolutional architecture as SENNA 0), 
which consists of a convolution layer and a max-pooling over the entire sentence for each feature 
map. This type of models, with local convolutions and a global pooling, essentially do a “soft” local 
template matching and is able to detect local features useful for a certain task. Since the sentence- 
level sequential order is inevitably lost in the global pooling, the model is incapable of modeling 
more complicated structures. It is not hard to see that our convolutional model degenerates to the 
SENNA-type architecture if we limit the number of layers to be two and set the pooling window 
infinitely large. 
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Figure 2: The cat example, where in the convolution layer, 
gray color indicates less confidence in composition. 
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3 Convolutional Matching Models 

Based on the discussion in Section |2J we propose two related convolutional architectures, namely 
Arc-I and Arc-II), for matching two sentences. 

3.1 Architecture-I (Arc-I) 

Architecture-I (Arc-I), as illustrated in Figure [3j takes a conventional approach: It first finds 
the representation of each sentence, and then compares the representation for the two sentences 
with a multi-layer perceptron (MLP) CD It is essentially the Siamese architecture introduced 
in mm, which has been applied to different tasks as a nonlinear similarity function l23l . Al¬ 
though Arc-I enjoys the flexibility brought by the convolutional sentence model, it suffers from a 
drawback inherited from the Siamese architecture: it defers the interaction between two sentences 
(in the final MLP) to until their indi¬ 
vidual representation matures (in the 
convolution model), therefore runs 
at the risk of losing details (e.g., a 
city name) important for the match¬ 
ing task in representing the sen¬ 
tences. In other words, in the forward 
phase (prediction), the representation 
of each sentence is formed without 
knowledge of each other. This can¬ 
not be adequately circumvented in 
backward phase (learning), when the 
convolutional model learns to extract 
structures informative for matching 
on a population level. 
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Figure 3: Architecture-I for matching two sentences. 


3.2 Architecture-II (Arc-II) 

In view of the drawback of Architecture-I, we propose Architecture-II (Arc-II) that is built directly 
on the interaction space between two sentences. It has the desirable property of letting two sentences 
meet before their own high-level representations mature, while still retaining the space for the indi¬ 
vidual development of abstraction of each sentence. Basically, in Layer-1, we take sliding windows 
on both sentences, and model all the possible combinations of them through “one-dimensional” (ID) 
convolutions. For segment i on Sx and segment j on Sy , we have the feature map 


4 '/' = 4yW) =«(*£’) •*(» < 1 ’«iS” > +‘ <1 ’"), 


1,3 ' 




(3) 


where z,- 0 ,- e 

i ,3 


D e 


simply concatenates the vectors for sentence segments for Sx and Sy : 


*(°) = r x T T ]T 

^i,j L x i:i+/ci — li y j:j-\-ki — lJ 


Clearly the ID convolution preserves the location information about both segments. After that in 

Layer-2, it performs a 2D max-pooling in non-overlapping 2x2 windows (illustrated in Figure [5} 

y (2J) _ Tnayf rJl ) /) MJ) J1,/) JUh 

— Ill z 2i-l,2ji z 2i,2j-n z 2i,2j : 


h3 


})• 


(4) 


In Layer-3, we perform a 2D convolution on k% x k% windows of output from Layer-2: 

4j f) = ■ <t(W (3i/) z^ + b {3J) ). (5) 

This could go on for more layers of 2D convolution and 2D max-pooling, analogous to that of 
convolutional architecture for image input fill . 


The 2D-Convolution After the first convolution, we obtain a low level representation of the inter- 
action between the two sentences, and from then we obtain a high level representation z\ ’■ which 
encodes the information from both sentences. The general two-dimensional convolution is formu¬ 
lated as 


i,3 


= ^(W^frD+b^), ^ = 3,5, • 




( 6 ) 


where z concatenates the corresponding vectors from its 2D receptive field in Layer-^—1. This 
pooling has different mechanism as in the ID case, for it selects not only among compositions on 
different segments but also among different local matchings. This pooling strategy resembles the 
dynamic pooling in m in a similarity learning context, but with two distinctions: 1) it happens on 
a fixed architecture and 2) it has much richer structure than just similarity. 
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Layer-1 (ID convolution) 


Layer-2 (2D-pooling) Layer-3 (2D-convolution) 


Figure 4: Architecture-II (Arc-II) of convolutional matching model 


3.3 Some Analysis on Arc-II 

Order Preservation Both the convolution 
and pooling operation in Architecture-II have 
this order preserving property. Generally, z -j 
contains information about the words in Sx 
before those in z l+ij, although they may be 
generated with slightly different segments in 
Sy, due to the 2D pooling (illustrated in Fig¬ 
ure [5}. The orders is however retained in a 
“conditional” sense. Our experiments show that 
when Arc-II is trained on the (Sx, Sy, Sy) 
triples where Sy randomly shuffles the words 
in Sy, it consistently gains some ability of find¬ 
ing the correct Sy in the usual contrastive neg¬ 
ative sampling setting, which however does not 
happen with Arc-I. 

Model Generality It is not hard to show that Arc-II actually subsumes Arc-I as a special case. 
Indeed, in Arc-II if we choose (by turning off some parameters in W^ 5 ')) to keep the representa¬ 
tions of the two sentences separated until the final MLP, Arc-II can actually act fully like Arc-I, 
as illustrated in Figure [6] More specifically, if we let the feature maps in the first convolution layer 
to be either devoted to Sx or devoted to Sy (instead of taking both as in general case), the output 
of each segment-pair is naturally divided into two corresponding groups. As a result, the output for 
(i f) 

each filter /, denoted [ :n (n is the number of sliding windows), will be of rank-one, possessing 
essentially the same information as the result of the first convolution layer in Arc-I. Clearly the 2D 
pooling that follows will reduce to ID pooling, with this separateness preserved. If we further limit 
the parameters in the second convolution units (more specifically w^ 2 ’^) to those for Sx and Sy, 
we can ensure the individual development of different levels of abstraction on each side, and fully 
recover the functionality of Arc-I. 
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Figure 5: Order preserving in 2D-pooling. 



Layer-1 (lD-convolution) Layer-2 (2D-pooling) Layer-3 (2D-conv.) final representation 


Figure 6: Arc-I as a special case of Arc-II. Better viewed in color. 
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As suggested by the order-preserving property and the generality of Arc-II, this architecture offers 
not only the capability but also the inductive bias for the individual development of internal abstrac¬ 
tion on each sentence, despite the fact that it is built on the interaction between two sentences. As 
a result, Arc-II can naturally blend two seemingly diverging processes: 1) the successive compo¬ 
sition within each sentence, and 2) the extraction and fusion of matching patterns between them, 
hence is powerful for matching linguistic objects with rich structures. This intuition is verified by 
the superior performance of Arc-II in experiments (Section[5| on different matching tasks. 

4 Training 

We employ a discriminative training strategy with a large margin objective. Suppose that we are 
given the following triples (x, y + , y - ) from the oracle, with x matched with y + better than with 
y~ . We have the following ranking-based loss as objective: 

e(x, y + , y - ; ©) = max(0,1 + s(x, y - ) - s(x, y + )), 

where s(x, y) is predicted matching score for (x, y), and © includes the parameters for convolution 
layers and those for the MLP. The optimization is relatively straightforward for both architectures 
with the standard back-propagation. The gating function (see Section [2]) can be easily adopted into 
the gradient by discounting the contribution from convolution units that have been turned off by 
the gating function. In other words, We use stochastic gradient descent for the optimization of 
models. All the proposed models perform better with mini-batch (100 ^ 200 in sizes) which can 
be easily parallelized on single machine with multi-cores. For regularization, we find that for both 
architectures, early stopping m is enough for models with medium size and large training sets 
(with over 500K instances). For small datasets (less than 10k training instances) however, we have 
to combine early stopping and dropout (8J to deal with the serious overfitting problem. 

We use 50-dimensional word embedding trained with the Word2Vec 03: the embedding for English 
words (Section |5^2| & |5.4| ) is learnt on Wikipedia (~1B words), while that for Chinese words (Section 
\53$ is learnt on Weibo data (^300M words). Our other experiments (results omitted here) suggest 
that fine-tuning the word embedding can further improve the performances of all models, at the cost 
of longer training. We vary the maximum length of words for different tasks to cope with its longest 
sentence. We use 3-word window throughout all experiment^] but test various numbers of feature 
maps (typically from 200 to 500), for optimal performance. Arc-II models for all tasks have eight 
layers (three for convolution, three for pooling, and two for MLP), while Arc-I performs better 
with less layers (two for convolution, two for pooling, and two for MLP) and more hidden nodes. 
We use ReLu o as the activation function for all of models (convolution and MLP), which yields 
comparable or better results to sigmoid-like functions, but converges faster. 

5 Experiments 

We report the performance of the proposed models on three matching tasks of different nature, and 
compare it with that of other competitor models. Among them, the first two tasks (namely, Sentence 
Completion and Tweet-Response Matching) are about matching of language objects of heterogenous 
natures, while the third one (paraphrase identification) is a natural example of matching homoge¬ 
neous objects. Moreover, the three tasks involve two languages, different types of matching, and 
distinctive writing styles, proving the broad applicability of the proposed models. 

5.1 Competitor Methods 

• WordEmbed: We first represent each short-text as the sum of the embedding of the 
words it contains. The matching score of two short-texts are calculated with an MLP with 
the embedding of the two documents as input; 

• DeepMatch: We take the matching model in lT3l and train it on our datasets with 3 
hidden layers and 1,000 hidden nodes in the first hidden layer; 

• uRAE+MLP: We use the Unfolding Recursive Autoencoder [190 to get a 100- 
dimensional vector representation of each sentence, and put an MLP on the top as in 
WordEmbed; 

• SENNA+MLP/sim: We use the SENNA-type sentence model for sentence representation; 

2 Our other experiments suggest that the performance can be further increased with wider windows. 

3 Code from: http : //nip.Stanford.edu/ ~ socherr / classif y Paraphrases.zip 
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• SenMLP: We take the whole sentence as input (with word embedding aligned sequen¬ 
tially), and use an MLP to obtain the score of coherence. 

All the competitor models are trained on the same training set as the proposed models, and we report 
the best test performance over different choices of models (e.g., the number and size of hidden layers 
in MLP). 

5.2 Experiment I: Sentence Completion 

This is an artificial task designed to elucidate how different matching models can capture the 
correspondence between two clauses within a sentence. Basically, we take a sentence from 
Reuters [ 12]with two “balanced” clauses (with 28 words) divided by one comma, and use the 
first clause as Sx and the second as Sy. The task is then to recover the original second clause for 
any given first clause. The matching here is considered heterogeneous since the relation between the 
two is nonsymmetrical on both lexical and semantic levels. We deliberately make the task harder 
by using negative second clauses similar to the original one^J both in training and testing. One 
representative example is given as follows: 

Sx • Although the state has only four votes in the Electoral College, 

Sy • its loss would be a symbolic blow to republican presidential candi 
date Bob Dole. 

Sy : but it failed to garner enough votes to override an expected veto by 
president Clinton. 

All models are trained on 3 million triples (from 600K positive 
pairs), and tested on 5OK positive pairs, each accompanied by 
four negatives, with results shown in Table]]] The two pro¬ 
posed models get nearly half of the cases righjjwith large margin 
over other sentence models and models without explicit sequence 
modeling. Arc-II outperforms Arc-I significantly, showing the power of joint modeling of match¬ 
ing and sentence meaning. As another convolutional model, SENNA+MLP performs fairly well 
on this task, although still running behind the proposed convolutional architectures since it is too 
shallow to adequately model the sentence. It is a bit surprising that uRAE comes last on this task, 
which might be caused by the facts that 1) the representation model (including word-embedding) is 
not trained on Reuters, and 2) the split-sentence setting hurts the parsing, which is vital to the quality 
of learned sentence representation. 

5.3 Experiment II: Matching A Response to A Tweet 

We trained our model with 4.5 million original (tweet, response) 
pairs collected from Weibo, a major Chinese microblog service 
[26j. Compared to Experiment I, the writing style is obviously 
more free and informal. For each positive pair, we find ten ran¬ 
dom responses as negative examples, rendering 45 million triples 
for training. One example (translated to English) is given below, 
with Sx standing for the tweet, Sy the original response, and Sy 
the randomly selected response: Sx • Damn, I have to work overtime 
this weekend! 

Sy : Try to have some rest buddy. 

Sy : It is hard to find a job, better start polishing your resume. 

We hold out 300K original (tweet, response) pairs and test the matching model on their ability to 
pick the original response from four random negatives, with results reported in Table [2] This task 
is slightly easier than Experiment I , with more training instances and purely random negatives. It 
requires less about the grammatical rigor but more on detailed modeling of loose and local matching 
patterns (e.g., work-overtimed rest). Again Arc-II beats other models with large margins, 
while two convolutional sentence models Arc-I and SENNA+MLP come next. 


4 We select from a random set the clauses that have 0.7^0.8 cosine similarity with the original. The dataset 
and more information can be found from http://www.noahlab.com.hk/technology/Leaming2Match.html 
5 Actually Arc-II can achieve 74+% accuracy with random negatives. 


Model 

P@l(%) 

Random Guess 

20.00 

DeepMatch 

49.85 

WordEmbed 

54,31 

SenMLP 

52.22 

SENNA+MLP 

56.48 

Arc-I 

59.18 

Arc-II 

61.95 


Table 2: Tweet Matching. 


Model 

P@l(%) 

Random Guess 

20.00 

DeepMatch 

32.50 

WordEmbed 

37.63 

SenMLP 

36.14 

SENNA+MLP 

41.56 

uRAE+MLP 

25.76 

Arc-I 

47.51 

Arc-II 

49.62 


Table 1: Sentence Completion. 
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5.4 Experiment III: Paraphrase Identification 

Paraphrase identification aims to determine whether two sentences have the same mean¬ 
ing, a problem considered a touchstone of natural language understanding. This experiment 
is included to test our methods on matching homogenous 
objects. Here we use the benchmark MSRP dataset fI7l . 
which contains 4,076 instances for training and 1,725 for 
test. We use all the training instances and report the test 
performance from early stopping. As stated earlier, our 
model is not specially tailored for modeling synonymy, 
and generally requires >100 K instances to work favor¬ 
ably. Nevertheless, our generic matching models still 
manage to perform reasonably well, achieving an accu¬ 
racy and FI score close to the best performer in 2008 
based on hand-crafted features El, but still significantly 
lower than the state-of-the-art (76.8%/83.6%), achieved 
with unfolding-RAE and other features designed for this task m. 

5.5 Discussions 

Arc -II outperforms others significantly when the training instances are relatively abundant (as in 
Experiment I & II). Its superiority over Arc-I, however, is less salient when the sentences have deep 
grammatical structures and the matching relies less on the local matching patterns, as in Experiment- 
I. This therefore raises the interesting question about how to balance the representation of matching 
and the representations of objects, and whether we can guide the learning process through something 
like curriculum learning (4). 

As another important observation, convolutional models (Arc-I & II, SENNA+MLP) perform 
favorably over bag-of-words models, indicating the importance of utilizing sequential structures in 
understanding and matching sentences. Quite interestingly, as shown by our other experiments, 
Arc-I and Arc-II trained purely with random negatives automatically gain some ability in telling 
whether the words in a given sentence are in right sequential order (with around 60% accuracy for 
both). It is therefore a bit surprising that an auxiliary task on identifying the correctness of word 
order in the response does not enhance the ability of the model on the original matching tasks. 

We noticed that simple sum of embedding learned via Word2Vec lfl4l yields reasonably good results 
on all three tasks. We hypothesize that the Word2Vec embedding is trained in such a way that the 
vector summation can act as a simple composition, and hence retains a fair amount of meaning in 
the short text segment. This is in contrast with other bag-of-words models like DeepMatch GD. 

6 Related Work 

Matching structured objects rarely goes beyond estimating the similarity of objects in the same do¬ 
main EH M ED, with few exceptions like mm. When dealing with language objects, most 
methods still focus on seeking vectorial representations in a common latent space, and calculating 
the matching score with inner product (T8j 251. Few work has been done on building a deep architec¬ 
ture on the interaction space for texts-pairs, but it is largely based on a bag-of-words representation 
of text m. 

Our models are related to the long thread of work on sentence representation. Aside from the models 
with recursive nature fl5l l2ll fl9l (as discussed in Section 2.1), it is fairly common practice to use 
the sum of word-embedding to represent a short-text, mostly for classification l22l . There is very 
little work on convolutional modeling of language. In addition to l6lfl8l. there is a very recent model 
on sentence representation with dynamic convolutional neural network 0. This work relies heavily 
on a carefully designed pooling strategy to handle the variable length of sentence with a relatively 
small feature map, tailored for classification problems with modest sizes. 

7 Conclusion 

We propose deep convolutional architectures for matching natural language sentences, which can 
nicely combine the hierarchical modeling of individual sentences and the patterns of their matching. 
Empirical study shows our models can outperform competitors on a variety of matching tasks. 

Acknowledgments: B. Hu and Q. Chen are supported in part by National Natural Science Foundation of 
China 61173075. Z. Lu and H. Li are supported in part by China National 973 project 2014CB340301. 


Model 

Acc. (%) 

Fl(%) 

Baseline 

66.5 

79.90 

Rus et al. (2008) 

70.6 

80.50 

WordEmbed 

68.7 

80.49 

SENNA+MLP 

68.4 

79.70 

SenMLP 

68.4 

79.50 

Arc-I 

69.6 

80.27 

Arc-II 

69.9 

80.91 


Table 3: The results on Paraphrase. 
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